[pull] master from ray-project:master by pull[bot] · Pull Request #1069 · garymm/ray

pull · 2026-06-12T01:18:14Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

#63890) obstore's S3Store defaults region to us-east-1 and does not follow AWS PermanentRedirect responses, so any obstore-routed S3 request against a bucket in a different region fails non-retryably with BareRedirect. - `_split_obstore_uri` rewrites `https://s3.<region>.amazonaws.com/<bucket>/<key>` to s3://<bucket> + <key> so StoreRegistry can apply region discovery. - `_discover_aws_bucket_region` resolves a bucket's region via `pyarrow.fs.resolve_s3_region` (already a required Ray Data dependency), cached per bucket. PyArrow issues the `x-amz-bucket-region` HEAD probe and handles the legacy global endpoint / IMDS edge cases; we additionally cache negative results so unresolvable buckets are probed at most once. The probe runs outside the cache lock, and the write-back never lets a `None` result overwrite a region a concurrent thread already cached (a real region always wins), so racing first-time lookups can't intermittently disable region injection. - `StoreRegistry.get` injects the discovered region for `s3://, s3a://` URLs, skipping injection when the caller already supplied a region or a custom endpoint (MinIO/R2/etc.). - All obstore call sites — the HEAD size probe (`_resolve_size`), the actor HEAD path (`_head_one`), ranged downloads (`_fetch_ranged`), and whole-file GET (`_fetch`) — go through `_split_obstore_uri`, so a path-style cross-region URL no longer slips past the rewrite (which previously left the size probe on the regional HTTPS store, returning size 0 and wrongly skipping ranged downloads). GCS and Azure are unaffected: neither encodes region in the endpoint (GCS uses a global endpoint addressed by bucket name; Azure is keyed by storage account), so they have no cross-region redirect failure mode. --------- Signed-off-by: Goutam <goutam@anyscale.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tream ray.get (#64014) `_next_sync` documents *"if an object is not available within the given timeout, it returns a nil object reference"*, but its end-of-stream handling calls `ray.get(generator_ref)` with **no timeout** (to distinguish a normal end of the stream from a task failure). The bug: after all yielded refs are consumed, the next `_next_sync(timeout_s=...)` call reaches that get. The generator's return object normally resolves locally, but if it lives in plasma and its node died, the get blocks the calling thread until lineage reconstruction re-runs the task — which needs a free CPU. On a saturated cluster this can deadlock: the blocked caller (e.g. Ray Data's scheduling thread) is what consumes outputs and releases the CPUs held by output-backpressured tasks, so reconstruction can never start. Serve's `to_object_ref(timeout_s=...)` similarly blocks past the user's requested timeout when a replica node dies. - Apply the caller's `timeout_s` to the end-of-stream get; report a timeout as a nil ref (retry), per the documented contract. - `timeout_s=None` (and `-1`) keep the blocking behavior, so `__next__` and other timeout-less callers are unchanged. - Regression test: stream exhausted + return object lost with its node → `_next_sync(timeout_s=0)` returns nil instead of blocking (hangs forever without the fix), and the stream terminates normally once the node is restored. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com> Co-authored-by: Claude Fable 5 <noreply@anthropic.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

…64034) The Windows base image build (ci/ray_ci/windows/build_base.sh) crashes when running `conda update -c conda-forge ca-certificates certifi`: AttributeError: module 'lib' has no attribute 'X509_V_FLAG_CB_ISSUER_CHECK' Upgrading the conda base env to python 3.10 (`conda install python=...`) pulls cryptography>=38, which removed `_lib.X509_V_FLAG_CB_ISSUER_CHECK`. pyopenssl is not part of that transaction, so the stale py3.8-era pyopenssl is left behind and still references the removed attribute at import. The next conda invocation imports requests -> urllib3.contrib.pyopenssl -> OpenSSL.crypto and detonates before conda can run, failing the base image build. Co-resolve pyopenssl 23.2.0 in the same conda install transaction so it stays compatible with the cryptography 38.x that gets installed. --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… id (#64044) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

…in (#64021) ## Why are these changes needed? `RAY_SERVE_PORT_QUARANTINE_S` holds a released direct-ingress replica port out of the allocation pool so that stale routing state pointing at the old replica drains before another replica can inherit the port. It currently defaults to **10 seconds**. The consumers that hold stale routing state the longest are soft-stopped (reloaded-out) HAProxy worker processes: they run no health checks (see [haproxy#3330](haproxy/haproxy#3330)) and keep routing to their frozen server list until `hard-stop-after` fires — **120s by default** (`RAY_SERVE_HAPROXY_HARD_STOP_AFTER_S`), commonly configured higher. With the current defaults the quarantine is 12x shorter than the window it exists to outlive: a freed port can be handed to a *different app's* replica at +10s while old workers keep sending it the previous app's traffic for up to +120s. Observed in sustained load testing: a just-freed direct-ingress port was recycled into another app's replica inside the stale-worker window, and a soft-stopped worker routed the old app's traffic to it — surfacing as unretried wrong-app 404s at the client. Health checks cannot catch this (they validate the address is serving, not which app is serving). ## What does this change do? Derives the default quarantine from the hard-stop window instead of a fixed 10s: ```python RAY_SERVE_PORT_QUARANTINE_S = get_env_float_non_negative( "RAY_SERVE_PORT_QUARANTINE_S", float(RAY_SERVE_HAPROXY_HARD_STOP_AFTER_S + 30), ) ``` The `+30s` margin covers the broadcast/coalesce/reload latency that elapses before an old worker's hard-stop clock starts (the clock runs from the worker's *orphaning* at reload, which can lag the port release). An explicit `RAY_SERVE_PORT_QUARANTINE_S` still overrides, and `0` still disables quarantining entirely. Sizing rule this encodes (must hold for correctness, now holds by default): ``` port quarantine >= hard-stop-after + reload propagation lag ``` Signed-off-by: harshit <harshit@anyscale.com> Co-authored-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>

goutamvenkat-anyscale and others added 5 commits June 11, 2026 19:26

[serve.llm] Don't clobber an explicitly-set request_id with the Serve…

0b3944d

… id (#64044) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

pull Bot locked and limited conversation to collaborators Jun 12, 2026

pull Bot added the ⤵️ pull label Jun 12, 2026

pull Bot merged commit d379709 into garymm:master Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ray-project:master#1069

[pull] master from ray-project:master#1069
pull[bot] merged 5 commits into
garymm:masterfrom
ray-project:master

pull Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

pull Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pull Bot commented Jun 12, 2026 •

edited

Loading